Skip to content

Conversation

nikki-t
Copy link
Collaborator

@nikki-t nikki-t commented Jul 10, 2025

Purpose

  • Modify SPS deployment to reduce operational costs by setting the DB instance to smaller EC2 instance and reducing the number of nodes instantiated for the various Airflow and OGC components.

Proposed Changes

  • [ADD] db_instance_class variable to control the EC2 instance selected for RDS
  • [CHANGE] Set Node affinity for Airflow core components plus OGC API to prefer the r5 instance family and 8 CPUs and make consistent for all

Issues

  • #419 - [Task]: Reduce SPS Operational Costs

Testing

  • Deployed to unity-venue-dev under nikki-3 and ran integration tests: https://github.com/unity-sds/unity-sps/actions/runs/16196483391/job/45733952614
  • The integration tests failed for the OGC API with unity_sps_ogc_processes_api_python_client.exceptions.NotFoundException: (404); Reason: Not Found so I ran them on my local laptop and they complete successfully.
  • Executed Terraform on local laptop and confirmed the DAGs were deployed to the nikki-3 deployment on unity-venue-dev.

@nikki-t nikki-t requested a review from LucaCinquini July 10, 2025 16:53
@nikki-t nikki-t self-assigned this Jul 10, 2025
@nikki-t
Copy link
Collaborator Author

nikki-t commented Jul 10, 2025

@LucaCinquini - There is one pending item. Once any DAGs are executed the airflow-celery-workers node remains. I don't know if we want to define this as a core component so that it runs with the other pods or if we want to try setting the consolidationPolicy to WhenEmptyOrUnderutilized so that Karpenter consolidates pods on nodes when the node's resources are underutilized?

@LucaCinquini
Copy link
Collaborator

@nikki-t : I tested the PR and it looks good - in steady state, the cluster has 3 nodes. 2 would be even better, can you try your suggestion of using "WhenEmptyOrUnderutilized" for the celer-workers, and see if the 3rd node is shut down? Thanks.

@nikki-t
Copy link
Collaborator Author

nikki-t commented Jul 14, 2025

@LucaCinquini - I will work on testing this today. For future reference here is the details on consolidation: https://karpenter.sh/v1.0/concepts/disruption/#consolidation

@nikki-t
Copy link
Collaborator Author

nikki-t commented Jul 14, 2025

@LucaCinquini - I pushed some changes that seem to have worked. The main thing we will want to decide on is if it is okay to wait for the DAG setup task to run? Because with these changes there is no celery worker node available to run on, so the task has to wait for a node to be launched and initialized.

@LucaCinquini
Copy link
Collaborator

@nikki-t : I think your last changes are good. To finalize, can you add a comment to the template file that explains what the implications of using 0 vs 1 are? By default, we should set the value to 0.

@nikki-t
Copy link
Collaborator Author

nikki-t commented Jul 15, 2025

@LucaCinquini - I added a comment to the template file. Let me know if you want me to add any further details.

@nikki-t
Copy link
Collaborator Author

nikki-t commented Jul 15, 2025

Not sure why the pre-commit GitHub action is failing, it seems to not be able to install a dependency.

Copy link
Collaborator

@LucaCinquini LucaCinquini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job :) Approving and merging.

@LucaCinquini LucaCinquini merged commit 53f37c9 into develop Jul 15, 2025
2 checks passed
@LucaCinquini LucaCinquini deleted the 419-reduce-ops-costs branch July 15, 2025 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants